AITopics | wrong answer

Collaborating Authors

wrong answer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Appendix

Neural Information Processing SystemsFeb-17-2026, 22:42:43 GMT

Literal Interpretation: One of the limitations is the model's tendency to interpret questions

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > France (0.04)
Asia > India > West Bengal (0.04)
Africa > Nigeria (0.04)
(2 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.81)

Add feedback

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Supplementary Materials Contents

Neural Information Processing SystemsFeb-17-2026, 08:35:59 GMT

In this section, we will describe the evaluation process in detail. We evaluated various L VLMs, including medical-specific models, open-source general models, and closed-source API general models.

green highlight, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)

Add feedback

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Neural Information Processing SystemsFeb-17-2026, 08:35:57 GMT

Before that, it is crucial to develop benchmarks to evaluate L VLMs' effectiveness in various medical applications.

large language model, machine learning, question answering, (22 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(11 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(9 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

Add feedback

A Diagnostic Benchmark for Very Long-form Video Language Understanding

Neural Information Processing SystemsFeb-15-2026, 21:24:53 GMT

To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Bangladesh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.67)

Industry:

Leisure & Entertainment (0.67)
Law (0.67)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

Çelebi, Yusuf, Ezerceli, Özay, Hussieni, Mahmoud El

arXiv.org Artificial IntelligenceDec-2-2025

This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.1722

Country: Asia > Middle East (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.68)
Education (0.66)
Law > International Law (0.55)
Health & Medicine > Therapeutic Area > Endocrinology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ab7e02fd60e47e2a379d567f6b54f04e-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 12:54:02 GMT

green highlight, right answer, table index, (15 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)

Add feedback

ab7e02fd60e47e2a379d567f6b54f04e-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 12:54:00 GMT

dataset, green highlight, segmentation, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(11 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(9 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

Add feedback

5f2809607f692d79a01c05c43d702883-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 04:07:48 GMT

benchmark, camera wearer, video, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report > New Finding (0.45)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback

Neologism Learning for Controllability and Self-Verbalization

Hewitt, John, Tafjord, Oyvind, Geirhos, Robert, Kim, Been

arXiv.org Artificial IntelligenceOct-10-2025

Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.08506

Country: Europe (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Filters

Collaborating Authors

wrong answer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

90ce332aff156b910b002ce4e6880dec-Paper-Datasets_and_Benchmarks.pdf

A Appendix

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI Supplementary Materials Contents

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

A Diagnostic Benchmark for Very Long-form Video Language Understanding

PARROT: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

ab7e02fd60e47e2a379d567f6b54f04e-Supplemental-Datasets_and_Benchmarks_Track.pdf

ab7e02fd60e47e2a379d567f6b54f04e-Paper-Datasets_and_Benchmarks_Track.pdf

5f2809607f692d79a01c05c43d702883-Paper-Datasets_and_Benchmarks_Track.pdf

Neologism Learning for Controllability and Self-Verbalization